Variant Discovery ◾ 131
mkdir fastq
cd fastq
while read id;
do
sam-dump \
--verbose \
--fastq \
--aligned-region chr21 \
--output-file ${id}_chr21.fastq \
${id}; \
done < ../ids.txt
cd ..
The download may take a long time depending on the speed of the Internet connection and
computer memory and processing units. The above script will create the “fastq” directory
and save the FASTQ files of chromosome 21 of 13 human individuals.
4.2.2.2.2 The reference genome
The FASTA sequence of the reference genome is required for reads mapping. We can
download it from a reliable database such as the NCBI Genome database. However, for
GATK pipeline, the sequence of the human genome can be downloaded from GATK
resource bundle, which is a collection of standard files prepared to be used with GATK.
The resource bundle is hosted on a Google Cloud bucket and can be accessed with a google
account using the following address:
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/
broad/hg38/v0/
TABLE 4.2 The NCBI SRA Run IDs and Individual
Information
Run ID
Country
Population
Gender
ERR1019055
China
East Asia
Female
ERR1019056
China
East Asia
Male
ERR1019057
China
East Asia
Female
ERR1019081
Pakistan
South Asia
Male
ERR1025616
Pakistan
South Asia
Male
ERR1019044
Kenya
Africa
Female
ERR1025600
Kenya
Africa
Female
ERR1025621
Nigeria
Africa
Male
ERR1025640
Senegal
Africa
Male
ERR1019034
Russia
Europe
Male
ERR1019045
France
Europe
Male
ERR1019068
Italy
Europe
Male
ERR1025614
Bulgaria
Europe
Male